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Abstract 

This project aims to explore which combinations of meteorological condi¬ 
tions are associated with extreme ground level ozone conditions. Our approach 
focuses only on the tail by optimizing the tail dependence between the ozone 
response and functions of meteorological covariates. Since there is a long list 
of possible meteorological covariates, the space of possible models cannot be 
explored completely. Consequently, we perform data mining within the model 
selection context, employing an automated model search procedure. Our study 
is unique among extremes applications as optimizing tail dependence has not 
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previously been attempted, and it presents new challenges, such as requiring a 
smooth threshold. We present a simulation study which shows that the method 
can detect complicated conditions leading to extreme responses and resists over¬ 
fitting. We apply the method to ozone data for Atlanta and Charlotte and find 
similar meteorological drivers for these two Southeastern US cities. We iden¬ 
tify several covariates which help to differentiate the meteorological conditions 
which lead to extreme ozone levels from those which lead to merely high levels. 


1 Introduction and Motivation 


Ground level ozone (O 3 ) is known to be detrimental to the human respiratory system 


(US EPA, 2006, Section 5.2). Research indicates that acute exposure to ozone can 


lead to a decline in lung function and increased inflammation (US EPA, 2006, Sec¬ 


tion 7.2.8). Bell et ah (2004) report that there is a relationship between increases in 


ozone and mortality in urban areas. Wilson et ah (2014) find that the effect of ozone 


is non-linear and that extremely high levels of ground level ozone could be especially 
harmful. For these reasons, it is important to understand what are the contributing 
factors which lead to the most extreme ozone levels. 

Ozone is a secondary pollutant, created via a chemical reaction which occurs when 
nitrogen oxides (NOx) and volatile organic compounds (VOCs) are exposed to ultra¬ 
violet radiation from sunlight. The meteorological drivers which are associated with 
high ozone levels (high temperature, low wind speed, high solar radiation) are well 


known (Jacob and Winner, 2009). However, it is less well known what meteorological 


conditions distinguish an extreme ozone day from one with merely high ozone levels. 
The left panel of Figure [T] partially illustrates this idea. Ground level ozone is plotted 
versus air temperature for Atlanta, Georgia from 1992 to 2010 (April-October). This 
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scatterplot shows that extreme levels of ozone occur when the air temperature is high; 
however, days with the highest ozone readings do not correspond to the days with the 
highest temperatures. Motivated by a larger US EPA funded project which aims to 
understand how atmospheric chemistry models represent extreme ozone, this study 
aims to better understand the meteorological drivers of extreme ozone. 

We aim to hud functions of meteorological covariates that have a high degree of tail 
dependence with ground level ozone. That is, we want to hnd functions of covariates 
which tend to be very large when ground level ozone is extreme. As is typical for an 
extreme value (EV) analysis, we only analyze data which are considered to be extreme 
and disregard that which is non-extreme. Our approach consists of two linked tasks. 
The hrst is an optimization problem: for a specihc set of covariates (which may be 
transformed from or functions of the original meteorological covariates), we want 
to hnd the coefficients in the linear combination of meteorological covariates that 
optimize tail dependence with ozone. The second is a data mining problem: we aim 
to hnd which of many possible meteorological covariates are associated with extreme 
ozone conditions. Here, data mining is a model selection problem where the model 
space is too large to search exhaustively. We perform data mining in a series of steps, 
which concludes with an automated search of the model space. 

Importantly, our goal is not prediction of ozone levels. Ozone prediction is best 
done by atmospheric chemistry models which capture the known physics and chem¬ 
istry in terms of the diherential equations which underlie these models. Our motiva¬ 
tion is that current models tend to poorly predict the most extreme events. Thus our 
goal is exploratory: by focusing only on extreme events, we aim to extract a signal 
between the extreme ozone responses and the associated meteorological conditions. 
When performing the second task of model selection, we will not limit our attention 
to the one model which hts best, but instead will explore the characteristics and com- 
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monalities of a number of the best fitting models. We do this for a couple of reasons. 
First, we do not believe that the true relationship between meteorology and ground 
level ozone lies within the model space we are able to explore. Second, we believe it 
is likely that there is more than one formula of meteorological conditions which can 
lead to extreme ground level ozone. 

We model the bivariate relationship between a linear combination of (functions 
of) covariates and ozone via the framework of bivariate regular variation. Dehned 
only in terms of the joint tail, multivariate regular variation allows one to model mul¬ 
tivariate threshold exceedances, thus focusing only on extreme behavior. We employ 
a bivariate tail dependence measure to quantify the dependence between the function 
of covariates and the response, and this measure is optimized subject to a constraint 
which imposes a marginal condition required by the regular variation framework. Our 
study is quite different in aim from a typical multivariate extremes study. The goal 
of most multivariate extremes analyses is to assess risk, and the quantity of interest 
is the estimated probability of an extreme event occuring simultaneously for multiple 
responses. We are unaware of any previous work which use extremes methods to 
optimize dependence or perform data mining. 

When modeling a response in terms of covariates, it is common to consider a type 
of regression analysis. Standard linear regression models the expected response (and 
thus the conditional distribution’s center), and consequently it tends to be a poor 
method for describing extremes. Approaches such as quantile regression or logistic 
regression can be tailored to focus on large values of the response. Our EV approach 
focuses on only the most extreme values of the response and is fundamentally different 
from regression approaches. However, one can make an analogy between our approach 
and standard least-squares regression: standard regression aims to hnd the linear 
combination of covariates which optimizes correlation with the response (in terms of 
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/^-squared), and our approach aims to find the coefficients which optimize a measure 
of limiting tail dependence. 

Our method also differs from conditional or regression approaches for extremes 


(Beirlant et ah, 2004, Ch. 7) which model the parameters of a univariate extremes 
model (e.g., generalized extreme value (GEV) or generalized Pareto distribution 
(GPD)) as functions of covariates. Gonditional models are typically applied when 
the covariate is measured on a longer time scale (e.g. annual) than the response, thus 
allowing the researcher to extract data (e.g. annual maxima or threshold exceedances) 
which are considered extreme for the particular covariate value. Gonditional mod¬ 
els for extremes have been used in atmospheric science studies to study trends by 
conditioning on year, or to study the relationship with slowly-evolving climatological 


regimes (Sillmann et ah, 2011 Maraun et ah, 2011). In contrast, Reich et ah (2013) 
model daily ozone levels conditional on a daily covariate and include an extremes 
model for the tail, but they do so by modelling the entire distribution and limiting 
their investigation to a single covariate. Our approach is specifically designed for 
covariates which vary on the same time scale as the response. Importantly, there is a 
subtle difference between the questions answered by a conditional extremes approach 
and our proposed approach. Because it models the tail conditional on the covariates, 
the conditional approach answers the question “Given certain conditions, what is the 
extreme behavior?” By optimizing tail dependence, our approach answers a slightly 
different question of “What conditions are most strongly associated with the extreme 
observations?” 

This paper is organized in the following manner. In Section we review the con¬ 
cepts of bivariate regular variation and tail dependence. We discuss tail dependence 
parameters and their estimators, and discuss why some estimators are better suited for 
optimization. We introduce tail dependence estimators that utilize a smooth thresh- 
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old. In Sectionj^we present our procedure. Section|^s simulation study demonstrates 
the ability of our approach to detect complicated conditions which lead to extreme 
behavior. In Section we apply a multistep data mining procedure to data from 
Atlanta, Georgia and Charlotte, North Carolina, and list meteorological covariates 
which exhibit a relationship with extreme ozone levels. 


Air Temp Vs. Ozone (Atlanta, GA) 


Transformed Air Temp Vs. Ozone 


Quantile versus Gamma Hat 



Daily High Surface Air Temp (Celcius) 


Unit Frechet Transformed Air Temp 


Quantile 


Figure 1: (L) A scatterplot of daily high surface air temperature versus peak daily 
maximum eight-hour surface ozone in Atlanta, Georgia from 1992 to 2010 (April- 
October). (C) A scatterplot of the same variables after transforming to the unit 
Frechet scale via rank transformations. (R) A plot of 7 , as dehned in Equation ([^, 
for a sequence of quantiles for air temperature and ozone with 95% conhdence bands. 


2 Extremes and Dependence in the Tail 

Multivariate regular variation implies that the joint tail decays like a power function. 
Since the framework is defined only in terms of the joint tail, it is useful for model¬ 
ing threshold exceedances. Importantly, the modeling framework of regular variation 
allows for asymptotic dependence. If {X, Y) is a bivariate random vector with com¬ 
mon marginals, then X and Y are asymptotically dependent if x = hm^j_,.a,+ P{Y > 
M I X > u) > 0 , where is the right endpoint of the marginal distribution’s sup¬ 
port. Most multivariate models such as those with Gaussian dependence structure 
and most copula models do not allow for asymptotic dependence. Like other models 
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for multivariate extremes, multivariate regular variation makes assumptions about 
the marginal distributions in order that dependence in the tail can be described. 


2.1 Bivariate Regular Variation and Exceedances 

A random vector Z G [0, oo)^ is regularly varying if there exists a sequence h{n) such 
that P(||Z’|| > h{n)) ~ n~^ and 


nP 


b{n) 


^ / \ 


( 1 ) 


where v denotes vague convergence on E = [0, oo]^\{0} and || ■ || is any norm (Resnick 


2007). A useful polar coordinate representation follows by dehning a radial compo¬ 
nent, R = IIZ’II and angular component, W = \\Z\\^^Z. Z = RW is regularly 
varying if 


nP (b{n) ^R > r,W G B) ^ r as n —)■ cx). 


( 2 ) 


where S' = {2 G E ; || 2 || = 1} is the unit sphere under any norm, and 77 is a finite 
measure for any 77—continuity Borel subset B of S. Because the right hand side 
of (|^ is a product measure, the radial and angular components become independent 
in the limit. We can characterize the tail behavior via u, or alternatively via a and 
the angular measure 77. 77 contains all dependence information, and b{n) can be 
chosen so that 77 is a probability measure. 

Proposition 5.10) shows that monotone transformations of the 
univariate marginal distributions do not change the fundamental nature of the tail 
dependence, in the sense that the domain of attraction is preserved. Thus, the frame¬ 
work can be used to model multivariate data with differing tail behavior, which may 
or may not be heavy-tailed. If the data come from Y which is not regularly varying. 


Resnick (1987 
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we assume that there exist probability integral transformations Ti {i = 1, 2), such that 
TiiYi) = Zj, and Z = (Zi, Z 2 )^ is regular varying. Statistical practice for multivariate 
extremes typically requires transforming marginals to a common, convenient distri¬ 
bution under which the dependence structure is more easily described. When dealing 
with real data, an analyst will typically transform each margin using Tj, where Tj 
is a probability integral transformation based on an estimated distribution function. 
We choose to transform our marginals to the unit Frechet distribution which has cdf 
G{z) = P(Zj < z) = exp {— 2 ;“^}. The unit Frechet distribution is regularly varying 
with a = 1, therefore our transformed data will be very heavy-tailed. Since a = 1, 
we hnd it convenient to use the Li norm: ||Z||i = Zi + Z 2 , as this implies FT is a 
probability measure if h{n) ~ 2n. 

Let Z = (Zi, Z 2 )^ be a bivariate regular-varying a = 1 random vector with com¬ 
mon marginal distributions. H can be thought of as a probability measure on [0,1]. 
Informally, as dependence increases, the mass of H concentrates toward the center 
(i.e., 1/2), and consequently large realizations of Z will tend to occur closer to the 
45-degree line. As dependence decreases, the large observations of Z will tend to oc¬ 


cur near the axes. If Zi and Z 2 are asymptotically independent (Ledford and Tawn 


1996), H has mass of .5 at {0} and {1}. When exploring data, a scatterplot after 


transformation to a heavy tailed distribution can help one to understand tail depen¬ 
dence. The plot in the center panel of Figure [T] gives a scatterplot of air temperature 
versus ozone in Atlanta, where both variables are transformed to have unit Frechet 
marginals via rank transformations. Due to the heavy tail, the majority of the points 
have congregated near the origin, and one is left to view the behavior of the large 
points. As many of the large points occur near the axes, it indicates that the level 
of tail dependence is relatively weak; that is, temperature alone does not describe 


extreme ozone conditions. 




2.2 Tail Dependence Summary Parameters and Estimators 

Within the regular variation framework, tail dependence is completely described by v 
or H. However, neither of these quantities is easily summarized. In order to perform 
a numerical optimization, we must summarize tail dependence with a single number. 

Several summary measures given in terms of v have been proposed. For example, 
if Z is bivariate regularly varying, 

, C)o] X [u, CX)]) 

X = liKi — -^- - -—. 

u-^ao 1/{[U,00\ X 0, CX) ) 


An estimator of y can be obtained by selecting a high u and replacing z/ with the 
observed counts: 


X{u) = 




(3) 


Coles et al. (1999) give a similar estimator for x which is symmetric in the sense that 


it also considers Zi conditioned on Z 2 exceeding u. Other tail dependence summary 
measures based on 1 / with similar estimators have been proposed; the extremogram of 


(Davis and Mikosch, 2009) can be viewed as a generalization of y applied in the time 
series context. Tail dependence summary measures with ‘counting estimators’ similar 
to ([^ cannot serve as the objective function in numerical optimization. In (|^ for a 
hxed u, x{u) gives the number of points that exceed u in both components divided 
by the number of points that exceed in the first component. In our optimization 
method, any function of meteorological covariates which results in the same number 
of exceedances would yield the same value of xiu) regardless of the exceedances’ 
values, causing the optimization to fail. 


Alternatively, summary measures can be derived from H. Larsson and Resnick 
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(2012) consider dependence snmmary parameters of the form 


p^= K{w)dH{w) 

J[o,i] 


( 4 ) 


for bounded and continuous k : [0,1] —)■ [0, cxd). Larsson and Resnick (2012) propose 
estimators and show consistency by relying on the intermediate asymptotics common 
to EVT. Let k := k{n) be a sequence such that k ^ oo and k/n ^ 0, and let 






t=l 


where b is defined as in rt2h It can be shown that z>„ —)■ z/ (Resnick 


2007 


Theorem 4.1) 


on the space of positive Radon measures on [0, cxd]^\{0}. Then, for sets in [0,1], dehne 




!>„{z I ||z|| > l,Zl||z|| ‘ e ■} _ El.l(Ai/')) " 


II 


e ■ 


l>n{z 


> 1 } 


V” A(hard) / _ll 

U( 


II 


(n/fc) 


where (^r) = I{z > 1}. The superscript denotes a ‘hard’ threshold at 1 which 

is standard for extremes. Let Wt = Zt^i/\\Zt\\. We obtain the estimator for the tail 
dependence measure 


-(hard) _ 
r K\n 


'[ 0 , 1 ] 




^(hard) 


11^41 

b{n/k) 


k{W) 


V"' A (hard) / 11^*11 
Z^t=l^ \b{n/k) 


(5) 


Because z>„ A it follows that 


(hard) 


H and A p (iResnickl 20041). Resnick 


(2007, p. 301) further states that if h{n/k) is replaced with an estimator h{n/k), and 
b{n/k)/b{n/k) A 1, the above convergences hold. 

We use a dependence summary parameter based on |Zi—Z 2 I, the Li distance to the 

similarly used the Li difference of 


45-degree line. The madogram (Cooley et al 
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variates, but assumed Z was max-stable rather than regular-varying. After converting 
to psuedo-polar coordinates, \Z\ — Z< 2 \ = \RW — R{1 — W)\ = R\2W — 1|. We require 
a function that is not dependent upon R, so we choose k{w) = | 2 rc — 1 |, noting 


Zi — Z2 

Z\ -|- Z 2 


R\2W-1 

R 


|2W-1 


Define the parameter 


7 = 


'[ 0 , 1 ] 


\2w — l\dH{w). 


Note that 0 < 7 < 1, and that a smaller value of 7 implies a higher degree of tail 
dependence. When 7 = 1 we have asymptotic independence, whereas 7 = 0 shows 
perfect tail dependence. As in Resnick ( |2004 ), we can define an estimator of 7 using 
the estimator of H, 


7n 


'[ 0 , 1 ] 


\2w-l\Hi^^'^‘^\dw). 


( 6 ) 


The right panel of Figure shows estimates of 7 „ between the ozone response and 
temperature for increasing thresholds. As 7 ^ achieves a level of about 0.72 shows 
that these two variables appear to exhibit asymptotic dependence, but the level of 
dependence is relatively weak. 


2.3 Tail Dependence Estimation with a Smooth Threshold 


As we wish to perform optimization, the hard threshold typically used in EVT is 
problematic as points would move back and forth across the threshold during opti¬ 


mization, likely not allowing the optimizer to converge. Chaudhuri and Solar-Lezama 


( 2011 ) propose using ‘smooth interpretations’ of discontinuous functions in numeric 


optimization. We use these techniques for EVT by replacing the hard threshold, 
^(hard)^ with a Smoothed one, which gradually increases the weights from 0 
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to 1 as the radial component increases. 


Let 



where jg ^ non-decreasing function which converges pointwise to on 

(0,1) U (l,oo). In the supplementary materials, we give sufficient conditions on 
^^smooth) ^^smooth) ^ Consistency of 


(smooth) 

in 



follows from the weak convergence of to H. 

The conditions on qj-q related to its convergence rate. In the regular 

variation framework, points pile up near the origin at rate n. Thus, near the origin, 
^^smooth) converge to zero quickly enough to negate this effect. Away from the 

origin, jg allowed to converge to more slowly. In practice, since n is 

hxed, the convergence rate of jg irrelevant. Typical of extremes procedures, 

in equations ([^ and (|^ b{n/k) is replaced by a suitably chosen threshold r. 


3 Procedure for Investigating Extreme Behavior 

The focus of our project is not parameter estimation as discussed in Section We 
now develop the two parts of our method to hnd linear combinations of functions of 
covariates which are associated with extreme behavior. We restrict our attention to 
linear combinations as we feel the model space would become unsearchable otherwise. 


12 












As the sample size is fixed in the remainder of this work, we omit the n subscript 

c r(smooth) 1 ^ (smooth) i r, 

irom On and % hereaiter. 

3.1 Optimizing Tail Dependence 

Assume for now that we work with a specific set of covariates. We aim to find the 
linear combination which optimizes tail dependence between these covariates and the 
response variable in terms of response at time t G N be given by 

the continuous random variable Yt (on its original scale), and let the A;—dimensional 
random vector of continuous covariates be Xt = (Wp, • • • 

To make use of the bivariate regular variation framework, we would like the re¬ 
sponse variable and the linear combination of covariates to have regularly varying 
marginal distributions. We can easily transform Yt to be approximately unit Frechet. 
We define Y;* = G-i[Fy(Ft)] where Fy is an estimated stationary marginal distribu¬ 
tion of Yt and G is the unit Frechet distribution function. 

To ensure that our linear combination has a marginal which is approximately unit 
Frechet, we do a two-step transformation procedure. We first transform each covariate 
to the iV(0,1) scale using a probability integral transformation, 

where <F represents the Gaussian distribution function with mean 0 and variance 1. 
Define the vector = {X ^^,..., letting S* denote its covariance matrix. We 

investigate functionals of the form 


Xy/3 = ftA'p + ... + AA'p,, (9) 

noting that E[X'l'^j3] = 0 and Var[X)"'^/3] = For identifiability, we constrain 

/3 such that f3'^T,*f3 = 1. 

The second transformation ensures our function of the covariates is approximately 
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unit Frechet. For optimization purposes, we assume that X^'^f3 is approximately 
normal. We then apply the transformation 




( 10 ) 


Employing the Gaussian cdf in (10) results in smooth behavior of the objective func¬ 


tion in the optimization procedure. If we were to use a rank-based transformation 
as we do for the response or in the hrst stage, the objective function would have a 
discontinuous jump at values of /3 where the ordering of (3 changes. 

Dehne 


E ”' X(smooth) z'i ^ (d)~F** I 


(3* = argmin 

{/3eK'=:/3l’S*/3=l} 


We use ^(smooth) ^ as our weight function, where a determines the amount 

of smoothness and tq is a selected threshold. 

Even with the implementation of the smoothed threshold, the optimization is non¬ 
trivial. Given k covariates, the constraint = 1 makes the optimization k — 1 

dimensional. One can perform optimization directly on /di ,... ,(3k using augmented 
Lagrangian methods available in the ‘alabama’ package (Varadhan, 2011) in R (|^ 


Development Gore Team, 2011), but we found that this optimization required good 


starting values. After transforming to standard polar coordinates: (3i = rcos6i, ..., 
(3k-i = r cos^fc-i, (3k = r ^Ilti constrained in terms 

of 0 = ( 01 ,..., Ok-iY' which has box constraints 6j G (0 , tt) for j = 1,2,..., k — 2 
and 0fc_i G (0,27r). We found that the differential evolution algorithm in ‘DEoptim’ 


(Mullen et al., 2011) and the generalized simulated annealing algorithm in ‘GenSA’ 


(Yang Xiang et ah, 2013) performed equally well in optimizing 6 . 


14 




























3.2 Model Comparison and Model Search 

To compare the model fits corresponding to different covariate sets, we need a criterion 
which evaluates each model’s level of tail dependence. To protect against overhtting, 
we use a rather standard 10-fold cross-validation (CV) procedure; however, our CV 
score is based on the tail dependence metric 7 rather than a typical mean-squared- 
error metric, which is poorly suited for extremes. Given a particular set of covariates, 
we hrst obtain Y^* and (for i = 1,..., /c) as described in Section 3. We then ran¬ 
domly partition these transformed observations into 10 equally sized subsets. Let the 
observation numbers corresponding to thepartition be given by Tp C {1,2,... ,n}. 
Similarly, let the indices corresponding to all observations except the partition be 
given by r_p = { 1 , 2 ,..., n}\rp. At the iteration (for p = 1 ,..., 10 ), we obtain 
the parameter estimates using all observations except those in the p*^ partition: 


^(-P) = 


argmm 


E X(smooth)I 

) X**if3)+Y** 


^(smooth) ('X”(/3) + Y^* 


After obtaining f3^ we calculate 7 (-p) by applying 0^ to the held out data: 


Eter + F;*) 




Eter S(^^°°^^'>iX**{|^) + Y^**) 


After iterating over the 10 partitions, we calculate CV = 10“^ Ei=i7(-p)- 

Since it is impossible to fit and compare all potential models, we require a method 
to search the model space to identify good-htting models. To describe the model 
space, we consider binary strings oj in the space { 0 , 1 }’’, where r is the total number 
of covariates including interactions and transformations. In these strings, a 1 (or 
0 ) in the position (for i = l,...,r) indicates the presence (or absence) of the 
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covariate. Thus, u; corresponds to the unique representation of one particular 
model. We denote the CV value for the model represented by uj with CV{(jj), and 
we wish to hnd models for which CV{(jj) is small. Thus, we frame variable selection 
as a combinatorial optimization problem. To search the model space we employ 
simulated annealing, which is known to give good solutions in these types of problems 


(Kirkpatrick et ah, 1983). 


We implement a slightly modified version of the simulated annealing procedure 
utilized in the R optim function (using the SANN method). We begin with an initial 
value, ujq G {0,1}^ and rely on a function / : {0,1}^ —?• {0,1}^' to choose a new 
string. We can construct / to exclude certain undesired models, such as models with 
highly correlated covariates. We describe the function / we use in our analysis in the 
Supplementary Materials. 

At the step we compare CK(/(ci;j_i)) to CK(n;j_i). > CV{f{u:j-i)) 

then we define loj := f{LOj_i) and proceed to the next iteration. If < 

then 


■= 


with probability exp{—ACV^/Temp^} 
ijjj-i with probability 1 — exp{—ACI^/Temp^} 


where Tempj is the current global temperature in the simulated annealing process 
and ACVj = CV{f{ujj_i)) — The global temperature is a parameter that 

is lowered throughout the optimization according to a cooling schedule. As in the 
R function optim using the SANN method, we use the logarithmic cooling schedule 


outlined in Belisle (1992). When the global temperature is high the process is more 


likely to move to n;s with higher CV values, reducing the chances of hnding a local 
optimum. When the global temperature is low, the process is unlikely to move to ujs 
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with higher CV values. Atypical for simulated annealing, our goal is not necessarily 
to hnd the unique global optimum, but to identify several models with very good 
scores and investigate their commonalities and differences. 

4 Simulation Study 

4.1 Description of Simulated Data 

We randomly generate 5,000 independent realizations of hve covariates, Xi, X 2 , X 3 , X 4 , X 5 . 
The hrst four covariates are four-dimensional Gaussian with non-identity covariance 
matrix. The hfth covariate is drawn independently of the hrst four covariates uni¬ 
formly on the unit interval, i.e. X 5 ~ U{0, 1). The response, Y, is a linear combination 
of functions of the hve covariates plus noise: 

Fi = -.3W,i + - .75W,4 - + 6 <h[(W,i - Xi,95)/.35]W,5 + (11) 

where X 1..95 represents the .95 quantile of Xi and St ~ hd N(0, .00125). The idea is 
to create a response which has a nonlinear relationship with the covariates which only 
appears when conditions are extreme. Specihcally, the term 6 <h[(Xi 1 —Xi- .95)/ -SdjXt^s 
only contributes to Yt when Xt^i is extremely large. 

Figure shows scatterplots of the response variable versus each of the hve covari¬ 
ates; we focus on the relationships driving the large response values. Large values of 
Yt are clearly associated with large values of Xt^i. The quadratic term causes large 
values of the response to occur when X* 2 is between 0 and 1.5. X^ 3 does not seem to 
be related to large response values, and X^ 4’s relationship with the response appears 
roughly linear. Finally, the evidence of X^^s’s inhuence on the extreme behavior is 
slight, and its interaction with X^ 1 is not apparent from these plots. 
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X1 vs. Y 


X2 vs. Y 


X3 vs. Y 



- 3 - 2-10 1 2 3 


X4 vs. Y 




X5 vs. Y 




- 3 - 2-10 1 2 3 4 


0.0 0.2 0.4 0.6 0.8 1.0 


Figure 2 : Scatterplots for each of the hve covariates versus the response in the simu¬ 
lation study. All variables are on their original scales. 


4.2 Model Selection and Estimation 

To test our model selection procedure, we consider the models M1-M6 listed below. 
The hrst model includes all hve covariates. The models M2-M4 each leave out a single 
covariate: A 3 , A 2 , Ai respectively. Model hve leaves out A 3 but adds an interaction 
between Ai and A 5 . Model six adds (A 2 )^ to model hve. Note that none of these 


models corresponds exactly to equation (11), but that M 6 is closest that it includes 


some type of interaction between Ai and A 5 and the quadratic behavior of A 2 . 
Ml: xr = [A*i, A*2,^;,3,^;,4’^r.5]/3i 

M2: xr = [xr,xr,xr,xr]f32 
M3: xr = [xr,xr,xr,xr](33 
M4: xr = [xr,xr,xr,xr]f3. 
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M5: X” 
M 6 : X;* 


r V"* 

^ t ,25 ^£,45 ^£, 5 ’ "^£,1 

r T^* V"* 

["^£,15 ^£, 2 : ^£,45 ^£,55 ^£,1 


X -^t,5]/35 
X X*„ {Xl^f](3, 


We optimize the tail dependence with a smooth threshold where a = 1.25 and 
tq = 40, the .95 quantile of the radial components \\Zt\\. Marginal transformations 
from the original scale were based on the rank transform. Table gives 7 and the 
CV value for each of the six models. Comparing models Ml and M2 shows that the 
CV method is effective in protecting against overhtting. The overall score 7 is higher 
(worse) for M2 as it is a submodel of Ml; however, the CV score for M2 is better 
indicating that Xt^z is not useful for describing extreme response values. The CV 
score for M4 shows that leaving out Xi is clearly not a good idea as the CV value 
is by far the highest of the six models. M5, which adds the interaction of Xi and 
X 5 , gives a noticeably improved CV value, and M 6 , which is closest to the generating 
model has the best CV score. 

Table 1: The score (the optimized value of 7 ) and the 10-fold cross-validation value 
for each of the six models considered in the simulation study. 



Ml 

M2 

M3 

M4 

M5 

M 6 

7 

CV 

0.4930 

0.5052 

0.4950 

0.5017 

0.5125 

0.5240 

0.6053 

0.6196 

0.4603 

0.4703 

0.4060 

0.4120 


Figure gives a visual way to assess tail dependence associated with each model. 
For each model, x^*{f3*) is plotted versus y^*, where we use lower case to denote the 


realization from (11). Note that the large points for the models with lower 7 scores 
and CV values, like M5 and M 6 , occur in the interior of the positive orthant, whereas 
models with low 7 s (M4) have more points near the axes. 

Table [^reports the parameter estimates with nonparametric bootstrap (Efron and 


TibshiranH |1993[) standard errors for M 6 . Due to the constraint and the marginal 
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Model 1 


Model 2 


Model 3 







Figure 3: For each of the six models we consider in the simulation study, the scat- 
terplot of versus yl* is given. Models that yield a linear combination with a 

higher degree of tail dependence with the response will result in a scatterplot with 
large points closer to the identity line. Note the hgures have been zoomed in to show 
the [0, 200] X [0, 200] box to better illustrate the difference in behavior, but this view¬ 
ing window does exclude the largest values. Expanding the plotting range does not 
affect the qualitative behavior. 


transformations, only the relative magnitude and sign of the parameter estimates are 
interpretable. The estimate with largest magnitude corresponds to Xi, indicating 
that it plays an important role in describing extreme levels of the response. Standard 
errors indicate that the /3xi, PxiX^ and /3(X2)2 terms are signihcant, and the 
signs of these agree with the behavior shown in Figure The fact that is not 
signihcant is also sensible, as Xs’s only contribution in the generating equation ( [IT| ) 
is via the interaction with Xi. 
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Table 2: The parameter estimates for M 6 in the simulation study with bootstrap 


standard errors in parentheses. 


Coefficient 


PX2 

PX4 

i^Xs 

I^XiXs 

/3(X2)2 

Estimate 

.61 (. 12 ) 

.10 (. 10 ) 

-.22 (.07) 

-.19 (.14) 

.47 (.15) 

-.17 (.08) 


4.3 Data Mining Procedure 


In order to evaluate our automated model search technique outlined in Section 3.2 


we expand the simulation study. In addition to the seven covariates used in M1-M6, 
we randomly generate 100 independent iV(0,1) random vectors. We utilize the model 
search procedure based on simulated annealing using the full set of 107 variables. 

We start eight optimizations in parallel, using a randomly generated starting value 
for each. Since each string is given a set amount of time to search the model space, 
the algorithm will likely not reach the global optimum. Rather, we aim to determine 
if the good scoring models capture the known causes for extreme responses for this 
simulated data. 

Table [^gives the strings with the best CV scores in the simulated annealing based 


automated model search procedure. M 6 , from Section 4.2 has a CV score of .4120 


We note that the top three strings from the model search have CV scores that are 
close to this value. More importantly, these best scoring models clearly convey the 
driving factors for extreme behavior, as all include Xi, X 4 , a quadratic term for X 2 , 
and the Xi x X 5 interaction. 


Table 3: Cross validation scores and selected covariates of the best three models found 
by the automated model search algorithm. 


Model 

CV 

Variables 

M 6 

.4120 

V"* V* V* V* i V* ^2 

^ 

SA String 1 

.4253 

V* V* \/ f V* ^2 

^ 5 155 ^£,98 

SA String 2 

.4264 

\/ f ^2 y* y* 

"^£,15 ^£,4’^ ^£,55 5 "^£,665 "^£,79 

SA String 6 

.4271 

y* y* y* ^ y* / y* \2 y* y* 
^£,15 ^£,4’"^£.1 ^ "^£,55 l^£,2J 5 ^£,9’^£.84 
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Also within the simulation study, we assessed the algorithm’s sensitivity to the 
threshold smoothness, but found that doubling or halving the value of a had very 
little effect on parameter estimates. We also compare our results to other possible 
approaches, but note that none of the other approaches were designed to optimize 
tail dependence. We outperform regression approaches and extremes regression ap¬ 
proaches in terms of Kendall’s r applied to the joint tail. Details on both the sensi¬ 
tivity analysis and comparison to other methods can be found in the supplementary 
materials. 


5 Application to Ground Level Ozone Pollution 

We now employ our method in a data mining capacity in order to better understand 
the meteorological drivers of extreme ground level ozone. We analyze ozone data 
for Atlanta, Georgia and Charlotte, North Carolina because of their geographical 
proximity and consistent data records for ozone. 

5.1 Data 

An EPA websit^ provides ground level ozone data as well as data on other pollutants. 
We selected station 13-121-0055 in Atlanta and station 37-119-1005 in Charlotte be¬ 
cause of their long data records and relative lack of missing values. Although the ozone 
level is measured hourly at each station, the response variable we use is the maximum 
eight hour average ozone as it is a value on which United States National Ambient Air 
Quality Standards are based. The EPA-defined ozone season for North Carolina is 
April through October while the ozone season for Georgia is March through October. 
In our analysis, we use daily responses from April through October for both locations. 
'^See http://www.epa.gov/airdata/ad_maps.htinl 
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Our analysis is based on data from the years 1992 through 2010, providing a total of 
4,037 observations for Atlanta and 4,055 observations for Charlotte. 

Ground level ozone readings have been decreasing over most of the United States 
in recent years. On its website, the EPA reports that “nationally, average ozone lev¬ 
els declined in the 1980s, leveled off in the 1990s, and showed a notable decline after 
2002.”|3 This is a trend that we notice in our exploratory data analysis. To account 
for non-stationarity, we transform the response variable by partitioning the data into 
nonoverlapping four-year blocks. In each block, we £t a gamma distribution to the 
observations below the .95 quantile and a generalized Pareto distribution for obser¬ 
vations above the .95 quantile. We then use these estimated distribution functions to 
transform the response to unit Frechet via a probability integral transformation. We 
employ a parametric model for marginal transformation as a rank transform resulted 
in common values in the tail if blocks have the same number of observations. 

Not surprisingly, a seasonal effect is also apparent in the ozone data. Because our 
aim is to link extreme ozone levels to meteorological conditions, we do not deseason- 
alize the ozone data. Rather, we assume that the seasonal response is well captured 
by conditioning on meteorological variables which themselves exhibit seasonality. 

We obtain meteorological covariates from the North American Regional Reanalysis 
(NARR)p| ( [Mesinger et al. , 2006). The data are in gridded cells, approximately 30km 
by 30km in size, differing from our ozone data which correspond to point locations. 
Understanding the relationship between ozone and meteorological variables of large 
spatial scales is motivated by the larger project’s goal of investigating the simulation 
of air quality extremes in atmospheric chemistry models. The NARR provides a 
large number of meteorological variables. Based on guidance from the collaborating 


^See http://www.epa.gov/airtrends/ozone.html, 

^NARR data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their 
Web site at http://www.esrl.noaa.gov/psd/ 
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atmospheric chemists, we initially use 18 NARR covariates in our models; however, 
we also consider these variables on different spatial scales and transformations of these 
variables leading to a longer overall list of possible covariates. We expect that many 
of these variables will be irrelevant in terms of explaining extreme ozone behavior. 
Furthermore, we aim to explore whether any interactions between covariates are useful 
to explain extreme ozone behavior. Hence, covariate selection is a primary goal of 
this study. 

We do not include information about the ozone precursors NOx and VOCs among 
our covariates. The EPA monitors NO and NO 2 , and although the NOx data record 


is not as extensive as it is for ozone, many studies (e.g., Eastoe (2009)) have found 
NOx measurements to be helpful when modeling ozone. The larger aim of our project 
is to provide information to improve atmospheric chemistry models which require 
information about NOx emissions, rather than measurements. NOx emissions are not 
known at the daily level, and are presumed by modelers to be relatively constant. Our 
aim is to link extreme ozone to meteorological conditions, and we believe that the 
daily variability found in NOx measurements is largely attributable to meteorology 
rather than fluctuations in emissions. 


5.1.1 Handling a Semi-continuous Covariate: Precipitation 

The method described in Section 3 requires continuous covariates to perform the 
two-step marginal transformation leading to We wish to investigate pre¬ 

cipitation’s effect on extreme ozone, but precipitation has a positive probability of 
being exactly zero. Exploratory analysis indicates that the presence of precipitation 
likely affects extreme ozone, but the amount of precipitation may not be important. 
Thus, we extend the method to account for variables like precipitation by including 
a precipitation indicator and interactions with this indicator. 
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In Section 3, it was critical that the distribution of X^^f3 be known for any 
f3, which we achieved with the constraint = 1. Including a precipitation 

indicator complicates matters, but we model in a way such that X^'^f3 will have a 
known distribution which is a mixture of Gaussians. 

Let Xt^p be the amount of precipitation at time t, where P{Xt^p = 0) > 0. Assume 
there are k + I — m total covariates included: k ‘main effects’ as before, and I effects 
to be included in precipitation interaction terms, m of which were already included 
as main effects. Including the precipitation covariate and interactions and letting 
Xt 1 ,... ,Xt^rn be the overlapping covariates changes equation ([^ to 


Xg/S = + + A',V, + ... + &A7_t+ (12) 


for 13 = (/3jp„,, = ((Di,..., A), (7y>, Dl''',.... As before, if Xy/3|(.Y,,p < 


iP)\\T 


c) ~ X(0,1) under the constraint = 1. However, given Xt^p > c, (12) 

becomes 


Xfl3\{X,,p>c) = (A + 0r’)A,:i + ... + (D„ + A',f>).Y,y + 

)lm+lA7y+i +-h PkXlp + + 

+ • • ■ + PPXlt+l-rs, (13) 

which is distributed N{/3q^\ where d/* is the covariance matrix of the contin¬ 
uous covariates and A = {{(3i +(/Sm + /3m+i, ■ ■ .,/3k,/3^li, ■ ■ 

Since XYf3 has a known distribution for any /3, we can proceed with the second 
transformation as before, employing sample covariance matrices and the observed 
mixture proportion. 
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5.2 Data Mining Procedure and Results 


As we are not able to fit all possible models, our data mining approach evolved 
as a multiple-step procedure, and we will discuss results at each step. We use the 


Yellowstone computing system (Computational and Information Systems Laboratory 


2012) to perform computations. Also, based on exploratory analysis, we employ a 


smooth threshold with mean equal to the .95 quantile of the radial components and 
a = 1.25. 


5.2.1 Model Exploration: Four Variable Models 


As a hrst step, we £t all possible models with up to four covariates because an 
exhaustive search of such models is possible. There are nearly 35,000 models of this 
type, and we fit all of the approximately 10,000 models that do not include highly 
correlated covariates. We allow the models to include the precipitation indicator and 
interactions between continuous covariates and the precipitation indicator, but do 
not consider interactions between continuous covariates. We dehne the precipitation 
indicator to be I{Xt^p > .Olin.}. 

Table reports the covariates in the five models with the lowest CV scores for 
Atlanta and Charlotte. In Atlanta, variables such as temperature, wind speed, down¬ 
ward shortwave radiative flux (dswrf, a measure of sunshine), the precipitation indi¬ 
cator, height of the planetary boundary layer (hpbl) at different times of day, and 
relative humidity seem to be in the best htting models. In Charlotte, similar variables 
appear along with northwest or west (shown as a negative coefficient on east) wind 
directions. Most of these variables are not surprising, as one would suspect high ozone 
to be associated with hot sunny days with low wind speeds. We view the results from 
this hrst step mostly as conhrmatory: the approach is choosing sensible covariates 
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when limited to only four variables at a time. That the precipitation indicator ap¬ 


pears is somewhat interesting as Jacob and Winner (2009) noted that precipitation 
has little effect on ground level ozone pollution. While precipitation may have little 
effect on mean levels of ground level ozone, it seems reasonable that the most extreme 
ozone levels do not occur on days where there is precipitation. 

We also note that the four-covariate models with the best CV scores are identical 
for Atlanta and Charlotte, and this allows us to compare the parameter estimates 
for these common models. Table also gives the parameter estimates and bootstrap 
standard errors (based on 640 bootstrap replications) for the top model. We note that 
the corresponding parameter estimates are similar and the signs of the parameter 
estimates are sensible. Both locations show that air temperature and downward 
shortwave radiation flux have a positive relationship with extreme ozone while wind 
speed and precipitation have negative relationships. Standard errors show there is 
large uncertainty associated with the parameter estimates, but our CV-based model 
selection procedure should protect us from identifying irrelevant covariates. 

It is clear that since we are able to include only four variables at a time, we 
have limited ability to explore the model space. However, results from this hrst step 
suggest a method for further exploration of the model space. 


5.2.2 ‘Core Plus Four’ Model Search 

The main result from the hrst model search stage is that high temperature, high 
sunshine, low wind speed, and lack of precipitation seem to provide good conditions 
for extreme ozone events. These conditions are largely explained by the four ‘core’ 
variables which appear in the best htting models for Atlanta and Charlotte: temper¬ 
ature, wind speed, dswrf, and the precipitation indicator. We continue our model 
search by htting all possible models which include these four variables, plus up to 
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Table 4: Covariates for the top five models containing four covariates for Atlanta and 
Charlotte are given along with their respective CV scores. Parameter estimates with 
bootstrap standard errors (based on 640 bootstrap replications) are reported for the 
top model at each location. 


Model 

CV 





Atlanta 1 

.5398 

temp 
.45 (.25) 

wnd spd 
-.44 (.36) 

dswrf 
.53 (.64) 

precip 
-5.82 (.20) 

Atlanta 2 

.5508 

temp 

.59 

wnd spd 
-.46 

rel hum 
-.28 

precip 

-5.46 

Atlanta 3 

.5512 

temp 

.80 

wnd spd 
-.39 

dswrf 

.34 

cape 

-.35 

Atlanta 4 

.5513 

temp 

.87 

wnd spd 
-.36 

hpbl 7am 
-.24 

cape 

-.28 

Atlanta 5 

.5519 

temp 

.53 

hpbl 7am 
-.39 

rel hum 
-.48 

precip 

-5.10 

Charlotte 1 

.5452 

temp 
.44 (.33) 

wnd spd 
-.50 (.42) 

dswrf 
.53 (.61) 

precip 
-5.86 (.43) 

Charlotte 2 

.5770 

temp 

.54 

wnd spd 
-.50 

NW wind 
.12 

dswrf 

.38 

Charlotte 3 

.5772 

temp 

.52 

wnd spd 
-.44 

dswrf 

.34 

rel hum 
-.19 

Charlotte 4 

.5774 

temp 

.54 

wnd spd 
-.46 

E wind 
-.10 

dswrf 

.45 

Charlotte 5 

.5781 

temp 

.55 

wnd spd 
-.51 

dswrf 

.46 

tcdc 

.15 


four additional main effects. A total of 534 models were £t at this stage. 

Results from the first stage also suggested slightly altering our list of covariates. 
One change that was made was to include only the minimum and maximum hpbl 
values rather than the 8 values recorded throughout each day. 

Including dswrf as a core variable led us to change how we dealt with sev¬ 
eral other cloud variables which we found to be strongly correlated with dswrf. 
We define a new variable which is residual to the information in dswrf. The new 
variable is a linear combination of five of the cloud variables in the NARR: x = 
[a^cdcon, a^cdiyr, a^icdc, a^mcdc, a^hcdc]^- Specifically, we find the parameter vector of unit 




length, a such that Va.T{a^x) = a^Sa is maximized and Cov{xdswTi,o,'^^) = 0. We 
estimate a via constrained optimization at several locations in the East and South¬ 
east United States (including Atlanta and Charlotte) and hnd a to be similar at all 
locations. Thus, we dehne the new cloud variable: 

^new.cloud •47X(;dcon ASXcdlyr •3'^^lcdc T -^^^vcicdc T •dSXjicdc- 

This new cloud variable can be loosely interpreted as a contrast between high level 
clouds (those with positive coefficients) and low level clouds (those with negative). 

Table 1^ compares the top hve ‘core-plus-four’ models at Atlanta and Charlotte to 
the core-only model. In Atlanta, we see a convincing drop in the CV scores of the 
best htting core-plus-four models compared to the core-only model. The top models 
tend to include minimum planetary boundary layer height (in 5 of the top 5 and 9 of 
the top 10), relative humidity (5/5 and 9/10), tropospheric height (3/5 and 5/10) and 
NE wind direction (3/5 and 4/10). The negative coefficients indicate that lower levels 
of planetary boundary layer height and relative humidity tend to be associated with 
extreme ozone, and the negative coefficient of the NE wind direction would indicate 
that extreme ozone tends to occur in Atlanta when wind is from the southwest. 

In Charlotte, we see a less convincing drop in the CV scores when we compare 
the best htting core-plus-four models to the core only model. However, there are 
some variables which are associated with most of the best htting models. The top 
models in Charlotte tend to include the new cloud variable (in 3 of the top 5 and 7 
of the top 10), tropospheric height (3/5 and 6/10), and E wind direction (3/5 and 
6/10). Interestingly, hpbl which seems to have a dare signal with extreme ozone in 
Atlanta, appears in few of the best htting Charlotte models. The diherences in the 
two cities’ variables may illustrate that diherent factors lead to extreme ozone in the 
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two cities, and the difference in the predominant wind direction may illustrate local 
differences in emissions sources. That the tropospheric height variable has a negative 
coefficient in Charlotte and a positive coefficient in Atlanta illustrates some of the 
difficulty in interpreting the parameter estimates. The difference in sign between the 
two cities may be due to the fact that in Atlanta, tropospheric height appears in 
models which also include planetary boundary layer height, whereas this was not the 
case in Charlotte. 

Table 5: CV scores for the core only model and the top five ‘core-plus-four’ models 
for both Atlanta and Charlotte. We also include the parameter estimates for the four 
non-core covariates._ 


Rank 

CV 

Covariates Added to Core Model 

Atl Core 

0.5398 





Atl 1 

0.5046 

hpbl min 
-0.34 

rel hum 
-0.19 

pres 

-0.12 

ht tropo 
0.22 

Atl 2 

0.5075 

hpbl min 

rel hum 

NE wnd 

ht tropo 



-0.30 

-0.17 

-0.16 

0.15 

Atl 3 

0.5083 

hpbl max 

hpbl min 

rel hum 

ht tropo 



-0.10 

-0.32 

-0.35 

0.16 

Atl 4 

0.5090 

hpbl min 

rel hum 

NE wnd 

pres 



-0.39 

-0.10 

-0.19 

-0.06 

Atl 5 

0.5097 

hpbl min 

rel hum 

NE wnd 

Iwrf 



-0.36 

-0.17 

-0.17 

0.08 

Char Core 0.5452 

Char 1 

0.5412 

E wnd 

pres 

Iwrf 

ht tropo 



-0.11 

0.08 

0.13 

-0.12 

Char 2 

0.5415 

cloud 

E wnd 

pres chng 

pres 



-0.11 

-0.09 

-0.04 

0.14 

Char 3 

0.5415 

cloud 

E wnd 

ht tropo 




-0.14 

-0.10 

-0.11 


Char 4 

0.5420 

hpbl max 

NW wnd 

Iwrf 

ht tropo 



-0.10 

0.06 

0.04 

-0.05 

Char 5 

0.5421 

cloud 

rel hum 

N wnd 

pres chng 



-0.06 

-0.21 

0.04 

-0.07 


Table gives bootstrapped standard errors for the best fitting models in Atlanta 
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and Charlotte. Because we are using a small subset of extreme data, and because 
these models include a large number of covariates which are likely dependent, it is 
not surprising that the standard errors are quite large. Because our aim is to uncover 
possible covariates for further exploration rather than to give a dehnitive model, we 
are not overly concerned with the large standard errors. 

Table 6: Parameter estimates of the best ‘core plus four’ models for Atlanta and 
Charlotte with bootstrap standard errors in parentheses. 


Atlanta 

temp 
.40 (.27) 
hpbl min 
-.34 (.47) 

wnd spd 
-.31 (.29) 
rel hum 
-.19 (.28) 

dswrf 
.29 (.50) 
pres 

-.12 (.26) 

precip 
-2.12 (.96) 
ht ropo 
.22 (.28) 

Charlotte 

temp 

wnd spd 

dswrf 

precip 


.46 (.31) 

-.41 (.34) 

.55 (.48) 

-4.35 (.89) 


E wnd 

pres 

Iwrf 

ht tropo 


-.11 (.39) 

.08 (.17) 

.13 (.24) 

-.12 (.27) 


5.2.3 Automated Model Search Procedure 

Our model search procedure thus far has been limited to at most eight main effects. 
We would like to further explore the model space to investigate whether interactions 
or a larger number of covariates would show even stronger tail dependence. Because a 
systematic model search becomes infeasible, we perform an automated model search 


utilizing our simulated annealing procedure described in Section 3.2 


Possible covariates include all the main effects considered in the previous ‘core plus 
four’ exploration, 77 interactions between continuous covariates, and 15 interactions 
between continuous covariates and the precipitation indicator. We include the four 
core variables in all considered models, as this reduces the search to a region of the 
model space where extremes are known to occur. Starting values are chosen by using 
the best core plus four models at each location. We perform 640 runs at each location. 
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The simulated annealing procedure requires a choice for the function /. We describe 
our choice in the Supplementary Materials. 

The covariates in the top hve models at each location are given in Table In 
Atlanta, and to a greater extent in Charlotte, we see a convincing drop between the 
CV scores of the best htting models found during the model search and the best core- 
plus-four model from the previous section. In both cities, we see relative humidity 
appears in many of the top models, although in Charlotte it tends to appear in an 
interaction. Hpbl, which as a main effect did not appear in many of the best htting 
Charlotte models in the previous section, now appears in all listed models, often in 
an interaction. We further notice that many of the interactions in these top models 
include include a core variable such as wind speed or downward short wave radia¬ 
tive hux, and a planetary boundary layer height or pressure variable. Interestingly, 
few of these interactions include air temperature. We also notice that just one of 
these top models contains an interaction with the precipitation indicator, which may 
suggest that the presence of precipitation, regardless of other variables, is enough to 
discourage the most extreme ozone events. 

6 Summary and Discussion 

In this work, we present an atypical multivariate EV study. Rather than aiming to 
assess the probability associated with rare events, we use EV methods to learn about 
the processes which lead to extreme behavior. Specihcally, we use the framework of 
multivariate regular variation to hnd functions of covariates which exhibit strong tail 
dependence with the response. We employ a multistep data mining procedure where 
each step built on what was learned from the previous one, and which culminates 
in an automated model search procedure. The unique aspect of this study requires 
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Table 7: The covariates and interactions in the best hve models in the automated 
model search applied to the Atlanta (top) and Charlotte data (bottom). Interactions 
between continuous covariates are indicated with a ‘x’. The four core main effects 
were also included in all models, but do not appear in the covariate list. 


Rank 

CV 




Atl C+4 

0.5046 

best core-plus-four model 

Atl 1 

0.4812 

rel hum 

dswrfxNE wnd 

wnd spdxpres chg 
dswrf X pres 

wnd spdxht tropo 
dswrf X hpbl. max 

Atl 2 

0.4823 

hpbl min 
dswrfxNE wnd 

rel hum 

wnd spdxhpbl min 

ht tropo 

N wnd xprecip 

Atl 3 

0.4836 

pres 

wnd spdxht tropo 

wnd spdxNE wnd 
dswrf X pres 

wnd spdxpres 
rel humxNW wnd 

Atl 4 

0.4837 

wnd spdxht tropo 
wnd spdxhpbl min 

dswrfxht tropo 
dswrf X hpbl min 

hpbl minxNE wnd 
hpbl minx rel hum 

Atl 5 

0.4868 

rel hum 

new. cloud X pres 

wnd spdxpres chg 
temp X hpbl max 

wnd spdxht tropo 
dswrf X hpbl max 

Char C-l-4 

0.5412 

best core-plus-four model 

Char 1 

0.5085 

hpbl. max 

wnd spdxhpbl min 

pres chg 

dswrf X hpbl.max 

rel humxlwrf 
hpbl.maxxrel hum 

Char 2 

0.5172 

hpbl min 

wnd spdxhpbl min 

rel hum 

dswrf X hpbl.max 

hpbl.maxxNW wnd 
dswrf X hpbl min 

Char 3 

0.5175 

dswrfxNE wnd 
wnd spdxhpbl min 

rel humxlwrf 
dswrf X hpbl min 

temp X hpbl min 
dswrf X rel hum 

Char 4 

0.5177 

dswrfxpres 
rel humxlwrf 

hpbl.maxxpres 
wnd spdxhpbl min 

hpbl. max xht tropo 
dswrf X new. cloud 

Char 5 

0.5181 

dswrfxNE wnd 
hpbl minxht tropo 

dswrf X pres 
new.cloud xE wnd 

hpbl minx pres 
wnd spdxhpbl min 


novel considerations for extremes such as which tail dependence summary measures 
are suitable for optimization and the implementation of a smooth threshold. Fitting 
and performing cross validation on literally thousands of models required large-scale 
computational resources. 

Our method arose from an applied problem which sought to investigate the meteo¬ 
rological conditions associated with the most extreme ozone levels, which in turn will 
help atmospheric chemists understand what distinguishes an extreme ozone day from 
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a day with merely high ozone levels. Not surprisingly, our results show that high air 
temperature, low wind speed, and high sunlight are influential in producing extreme 
ozone events. However, it appears that covariates besides these likely ones also play 
a role in distinguishing extreme ozone days. The models we found that exhibited 
the strongest tail dependence tended to include interactions between the likely co¬ 
variates and other covariates, and the fact that the best fitting models include many 
interactions speaks to the complex relationship between ozone and meteorological 
conditions. Our analysis uncovers local effects such as which wind direction seems to 
be associated with extreme ozone levels in each of the cities. We do not end up with 
a single best model, as there is not likely one unique set of conditions which leads to 
extreme behavior. Rather, we provide our collaborating atmospheric chemists with a 
list of possible contributing factors for further investigation. That our approach is en¬ 
tirely data-driven and does not include any of the physical and chemical mechanisms 
which lead to ozone creation means that it provides a view of the drivers of extreme 
ozone independent from the view given by current atmospheric chemistry models. 

There are several avenues for further development. The meteorological covariates 
we consider are quite dependent and issues similar to the notion of colinearity warrant 
further consideration. We would like to investigate methods which pool information 
across stations, as the standard errors show that extracting a signal from individual 
locations is not easy. A spatial extension to our model could reduce uncertainty and 
provide additional understanding of how the meteorological drivers of extreme ozone 
differ over a larger spatial domain. As our data mining procedure is closely tied to 
model selection, one could consider extending the optimization procedure to penalize 


for complexity (e.g. LASSO Tibshirani (1996)). However, since our aim is to tease 
out potential meteorological drivers of extreme ozone for further study, penalizing 
model complexity is not a primary concern in the optimization phase. 
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